{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chapter 4\n",
    "---\n",
    "# Handling Numerical Data\n",
    "\n",
    "### 4.0 Introduction\n",
    "Quantitative data is the measurment of something--weather class size, monthly sales, or student scores. The natural way to represent these quantities is numerically (e.g., 20 students, $529,392 in sales). In this chapter we will cover numerous strategies for transforming raw numerical data into features purpose-built for machine learning algorithms\n",
    "\n",
    "### 4.1 Rescaling a feature\n",
    "Use scikit-learn's `MinMaxScaler` to rescale a feature array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.        ],\n",
       "       [0.28571429],\n",
       "       [0.35714286],\n",
       "       [0.42857143],\n",
       "       [1.        ]])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn import preprocessing\n",
    "\n",
    "# create a feature\n",
    "feature = np.array([\n",
    "    [-500.5],\n",
    "    [-100.1],\n",
    "    [0],\n",
    "    [100.1],\n",
    "    [900.9]\n",
    "])\n",
    "\n",
    "# create scaler\n",
    "minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))\n",
    "\n",
    "# scale feature\n",
    "scaled_feature = minmax_scaler.fit_transform(feature)\n",
    "\n",
    "scaled_feature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Discussion\n",
    "Rescaling is a common preprocessing task in machine learning. Many of the algorithms described later in this book will assume all features are on the same scale, typically 0 to 1 or -1 to 1. There are a number of rescaling techniques, but one of the simlest is called *min-max scaling*. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. Specfically, min-max calculates:\n",
    "$$\n",
    "x_i^` = \\frac{x_i - min(x)}{max(x) - min(x)}\n",
    "$$\n",
    "\n",
    "where x is the feature vector, $x_i$ is an individual element of feature x, and $x_i^`$ is the rescaled element\n",
    "\n",
    "#### See Also\n",
    "* Feature scaling, wikipedia (https://en.wikipedia.org/wiki/Feature_scaling)\n",
    "\n",
    "### 4.2 Standardizing a Feature\n",
    "scikit-learn's `StandardScaler` transforms a feature to have a mean of 0 and a standard deviation of 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.76058269],\n",
       "       [-0.54177196],\n",
       "       [-0.35009716],\n",
       "       [-0.32271504],\n",
       "       [ 1.97516685]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn import preprocessing\n",
    "\n",
    "# create a feature\n",
    "feature = np.array([\n",
    "    [-1000.1],\n",
    "    [-200.2],\n",
    "    [500.5],\n",
    "    [600.6],\n",
    "    [9000.9]\n",
    "])\n",
    "\n",
    "# create scaler\n",
    "scaler = preprocessing.StandardScaler()\n",
    "\n",
    "# transform the feature\n",
    "standardized = scaler.fit_transform(feature)\n",
    "\n",
    "standardized"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Discussion\n",
    "A common alternative to min-max scaling is rescaling of features to be approximately standard normally distributed. To achieve this, we use standardization to tranform the data such that it has a mean, $\\bar x$, or 0 and a standard deviation $\\sigma$, of 1. Specifically, each element in the feature is transformed so that:\n",
    "$$\n",
    "x_i^` = \\frac{x_i - \\bar x}{\\sigma}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Where $x_I^`$ is our standardized form of $x_i$. The transformed feature represents the number of standard deviations in the original value is away from the feature's mean value (also called a *z-score* in statistics)\n",
    "\n",
    "Standardization is a common go-to scaling method for machine learning preprocessing and in my experience is used more than min-max scaling. However it depends on the learning algorithm. For example, principal component analysis often works better using standardization, while min-max scaling is often recommended for neural netwroks. As a general rule, I'd recommend defauling to standardization unless you have a specific reason to use an alternative.\n",
    "\n",
    "We can see the effect of standardization by looking at the mean and standard deviation of our solutions output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean 0.0\n",
      "Standard Deviation: 1.0\n"
     ]
    }
   ],
   "source": [
    "print(\"Mean {}\".format(round(standardized.mean())))\n",
    "print(\"Standard Deviation: {}\".format(standardized.std()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If our data has significant outliers, it can negatively impact our standardizatino by affecting the feature's mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikit-learn, we do this using the *RobustScaler* method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-1.87387612],\n",
       "       [-0.875     ],\n",
       "       [ 0.        ],\n",
       "       [ 0.125     ],\n",
       "       [10.61488511]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create scaler\n",
    "robust_scaler = preprocessing.RobustScaler()\n",
    "\n",
    "# transform feature\n",
    "robust_scaler.fit_transform(feature)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 Normalizing Observations\n",
    "Use scikit-learn's `Normalizer` to rescale the feature values to have unit norm (a total length of 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.70710678, 0.70710678],\n",
       "       [0.30782029, 0.95144452],\n",
       "       [0.07405353, 0.99725427],\n",
       "       [0.04733062, 0.99887928],\n",
       "       [0.95709822, 0.28976368]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn.preprocessing import Normalizer\n",
    "\n",
    "# create feature matrix\n",
    "features = np.array([\n",
    "    [0.5, 0.5],\n",
    "    [1.1, 3.4],\n",
    "    [1.5, 20.2],\n",
    "    [1.63, 34.4],\n",
    "    [10.9, 3.3]\n",
    "])\n",
    "\n",
    "# create normalizer\n",
    "normalizer = Normalizer(norm=\"l2\")\n",
    "\n",
    "# transofmr feature matrix\n",
    "normalizer.transform(features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Discussion\n",
    "Many rescaling methods operate of features; however, we can also rescale across individual observations. `Normalizer` rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features (e.g. text-classification when every word is n-word group is a feature).\n",
    "\n",
    "`Normalizer` provides three norm options with Euclidean norm (often called L2) being the default:\n",
    "$$\n",
    "||x||_2 = \\sqrt{x_1^2 + x_2^2 + ... + x_n^2}\n",
    "$$\n",
    "\n",
    "where x is an individual observation and x_n is that observation's value for the nth feature.\n",
    "\n",
    "Alternatively, we can specify Manhattan norm (L1):\n",
    "$$\n",
    "||x||_1 = \\sum_{i=1}^n{x_i}\n",
    "$$\n",
    "\n",
    "Intuitively, L2 norm can be thought of as the distance between two poitns in New York for a bird (i.e. a straight line), while L1 can be thought of as the distance for a human wlaking on the street (walk north one block, east one block, north one block, east one block, etc), which is why it is called \"Manhattan norm\" or \"Taxicab norm\".\n",
    "\n",
    "Practically, notice that `norm='l1'` rescales an observation's values so they sum to 1, which can sometimes be a desirable quality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sum of the first observation's values: 1.0\n"
     ]
    }
   ],
   "source": [
    "# transform feature matrix\n",
    "features_l1_norm = Normalizer(norm=\"l1\").transform(features)\n",
    "print(\"Sum of the first observation's values: {}\".format(features_l1_norm[0,0] + features_l1_norm[0,1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.9 Grouping Observations Using Clustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature_1</th>\n",
       "      <th>feature_2</th>\n",
       "      <th>group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-9.877554</td>\n",
       "      <td>-3.336145</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-7.287210</td>\n",
       "      <td>-8.353986</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-6.943061</td>\n",
       "      <td>-7.023744</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-7.440167</td>\n",
       "      <td>-8.791959</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-6.641388</td>\n",
       "      <td>-8.075888</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   feature_1  feature_2  group\n",
       "0  -9.877554  -3.336145      0\n",
       "1  -7.287210  -8.353986      2\n",
       "2  -6.943061  -7.023744      2\n",
       "3  -7.440167  -8.791959      2\n",
       "4  -6.641388  -8.075888      2"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from sklearn.datasets import make_blobs\n",
    "from sklearn.cluster import KMeans\n",
    "\n",
    "features, _ = make_blobs(n_samples = 50,\n",
    "                         n_features = 2,\n",
    "                         centers = 3,\n",
    "                         random_state = 1)\n",
    "\n",
    "df = pd.DataFrame(features, columns=[\"feature_1\", \"feature_2\"])\n",
    "\n",
    "# make k-means clusterer\n",
    "clusterer = KMeans(3, random_state=0)\n",
    "\n",
    "# fit clusterer\n",
    "clusterer.fit(features)\n",
    "\n",
    "# predict values\n",
    "df['group'] = clusterer.predict(features)\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.10 Deleteing Observations with Missing Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.1, 11.1],\n",
       "       [ 2.2, 22.2],\n",
       "       [ 3.3, 33.3]])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "features = np.array([\n",
    "    [1.1, 11.1],\n",
    "    [2.2, 22.2],\n",
    "    [3.3, 33.3],\n",
    "    [np.nan, 55]\n",
    "])\n",
    "\n",
    "# keep only observations that are not (denoted by ~) missing\n",
    "features[~np.isnan(features).any(axis=1)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature_1</th>\n",
       "      <th>feature_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.1</td>\n",
       "      <td>11.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.2</td>\n",
       "      <td>22.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.3</td>\n",
       "      <td>33.3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   feature_1  feature_2\n",
       "0        1.1       11.1\n",
       "1        2.2       22.2\n",
       "2        3.3       33.3"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.DataFrame(features, columns=[\"feature_1\", \"feature_2\"])\n",
    "df.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Discussion\n",
    "Most machine learnign algorithms cannot handling any missing values in the target and feature arrays. The simplest solution is the delete every observation that contains one or more missing values\n",
    "\n",
    "There are three types of missing data:\n",
    "\n",
    "*Missing Completely At Random (MCAR)*\n",
    "* The probability that a value is missing is independent of everything.\n",
    "\n",
    "*Missing At Random (MAR)*\n",
    "* The probability that a value is missing is not completely random, but depends on information capture in other feature\n",
    "\n",
    "*Missing Not At Random (MNAR)*\n",
    "* The probability that a value is missing is not random and depends on information not captured in our features\n",
    "\n",
    "#### See Also\n",
    "* Identifying the Three Types of Missing Data (https://measuringu.com/missing-data/)\n",
    "* Missing-Data Imputation (http://www.stat.columbia.edu/~gelman/arm/missing.pdf)\n",
    "\n",
    "### 4.11 Imputing Missing Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True Value: 0.8730186113995938\n",
      "Imputed Value: -3.058372724614996\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.datasets import make_blobs\n",
    "from sklearn.preprocessing import Imputer\n",
    "\n",
    "# make fake data\n",
    "features, _ = make_blobs(n_samples = 1000,\n",
    "                        n_features = 2,\n",
    "                        random_state = 1)\n",
    "\n",
    "# standardize the features\n",
    "scaler = StandardScaler()\n",
    "standardized_features = scaler.fit_transform(features)\n",
    "\n",
    "# replace the first feature's first value with a missing value\n",
    "true_value = standardized_features[0, 0]\n",
    "standardized_features[0,0] = np.nan\n",
    "\n",
    "# create imputer\n",
    "mean_imputer = Imputer(strategy=\"mean\", axis=0)\n",
    "\n",
    "# impute values\n",
    "feautres_mean_imputed = mean_imputer.fit_transform(features)\n",
    "\n",
    "# compare true and imputed values\n",
    "print(\"True Value: {}\".format(true_value))\n",
    "print(\"Imputed Value: {}\".format(feautres_mean_imputed[0,0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### See Also\n",
    "* A Study of K-Nearest Neighbor as an Imputation Method (http://conteudo.icmc.usp.br/pessoas/gbatista/files/his2002.pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:machine_learning_cookbook]",
   "language": "python",
   "name": "conda-env-machine_learning_cookbook-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
